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'""' ' Abstract 

> 

O/ ' It is possible to approach regression analysis with random covari- 

ates from a semiparametric perspective where information is combined 
from multiple multivariate sources. The approach assumes a semi- 

^•^ I parametric density ratio model where multivariate distributions are 

"regressed" on a reference distribution. Each multivariate distribution 
and a corresponding conditional expectation-regression-of interest is 
then estimated from the combined data from all sources. Graphical 
and quantitative diagnostic tools are suggested to assess model valid- 

^^ ' ity. The method is applied in quantifying the effect of age on weight of 



in 



o 
o 



JH I germ cell testicular cancer patients. Comparisons are made with both 

multiple regression and nonparametric kernel regression. 
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1 Introduction 

The purpose of this paper is to address the relationship between weight, 
height, and age of germ cell testicular cancer patients. We approach this 
problem through a nonlinear regression method based on the density ratio 
model. The method points to the importance of the inclusion of age as a 
covariate in the prediction of weight, as expressed by a significant reduction 
in mean square error (MSE) and mean absolute error (MAE), in both case 
and control groups. This effect is not clearly discernible in a straightforward 
application of multiple regression. 

Before the actual application to germ cell testicular data in Section 4, we 
describe the method and some of its underpinnings, and suggest appropriate 
graphical and quantitative diagnostic tools in Section 2. To gain insight into 
the potential of the approach and the use of the diagnostic tools, we report 
simulation results in Section 3. 

1.1 Background and Preliminaries 

Suppose we have m = q + 1 data sources, such as q case groups and a 
control group, each giving a sample of random vectors from an unknown 
multivariate distribution. Assume that each vector consists of a random 
response and its random covariates. Given this setup, the semiparametric 
multivariate density ratio model can be used in providing an alternative to 
classical regression with random covariates, as well as to kernel nonpar a- 
metric regression. This approach falls under the general rubric of fusion or 
integration of information from multiple sources, and it does not depend on 
the normal assumption. 

In the density ratio model one distribution serves as a reference or base- 
line, and all other distributions are exponential tilts of the reference. In 
its one dimensional form the model is motivated by the classical one-way 
analysis of variance with m = q + 1 independent normal random samples, 
and logistic regression (Fokianos et al 2001, Qin and Zhang 1997). In its 
multivariate form, the model is motivated by classical classification given 
multivariate normal samples, and generalized logistic regression (Anderson 
1971, Prentice and Pyke 1979). 

Formally, in the one-dimensional case there are m = q+ 1 random sam- 
ples, (a;ii,...,xi„J,..., ixgi,...,Xgng), (xmi,...,Xm„„), with probability 
density functions Qi 

Xij -^ Qi, i = l,...,q,m, j = l,...,ni, (1) 



where (/„ = ff is called the reference probability density, and where the gi 
satisfy the density ratio model 



9j[x)_ 

9{x) - ' ''■■■'' 



exp(aj +/3'h(a;)), j = I, 



Assuming that the distortion function h{x) is a known vector-valued func- 
tion, the objective is to estimate the reference density g and the parameters 
aj , f3j from the combined data 

t ■[ (^X^i , . . . , Xin^ J , . . . , [Xqi , . . . , Xqng ) , {X^yii , • • • , ^mnm / J • \'^) 

The density ratio model has been applied in various problems includ- 
ing kernel density estimation (Fokianos 2004, Cheng and Chu 2004, Qin 
and Zhang 2005) analysis of variance (Fokianos et al 2001), AIDS vaccine 
trials (Gilbert et al 1999), mortality rate prediction (Kedem et al 2008), 
case-control studies (Prentice and Pyke 1979, Qin 1998), logistic model val- 
idation (Qin and Zhang 1997), cluster detection (Wen and Kedem 2009), 
and goodness of fit (Zhang 2000). A two-dimensional case-control applica- 
tion has been made recently in Kedem et al (2009). 

In this paper the L-dimensional formulation of the model is used in 
the estimation of the conditional expectation of a response given covari- 
ate information. Specifically, for each of the m data sources, we use the 
L-dimensional density ratio model in predicting, via the estimated condi- 
tional expectation, the response variable given the corresponding covariate 
information, and propose measures of goodness of fit and diagnostic plots 
to check the validity of the model. A comparison with linear multiple re- 
gression and the Nadaraya- Watson kernel nonparametric regression is made 
using both real and simulated data. 

1.2 Motivation 

The L-dimensional formulation of the model was motivated by an extension 
of a previous analysis of two risk factors, body weight and height, of germ 
cell testicular cancer to including three or more risk factors or covariates 
(Kedem et al. 2009). We wanted to include age in the analysis with height 
and weight as age is both an important risk factor and potential confounder 
since the incidence of testicular cancer varies by age, peaking around 25-35 
years for the most common types of testicular cancer, and age correlates 
with body weight (McGlynn and Cook, 2010; Ogden et al. 2004). The use 
of a two-dimensional density ratio model in the previous analysis uncovered 



an important contribution of body weight in the presence of height that was 
not observed in logistic regression analyses (McGlynn et al. 2006). The 
proposed extension of the density ratio model provides an opportunity to 
explore the interrelationships of height and weight with testicular cancer 
while controlling for age by estimating the conditional expectation of weight 
given height and age. 

2 Statistical Formulation 

2.1 The L-Dimensional Density Ratio Model 

Suppose we have m = q + 1 data sets or samples of L-dimensional vectors, 
where each vector consists of L — 1 covariates and one response, and assume 
that the ith sample size is rij. Thus, for i = 1, . . . ,q,m, j = 1, . . . , nj we 
have 

{xiji,Xij2,... ,Xij(^L-i),yij) ~ gi{xi,... ,X(^L-i),y)- 

We choose g = gm{xi, ■ ■ ■ ,X(^i_i),y) as a reference or baseline probability 
density function (pdf), and let each gi{xi, . . . , X(j^_i), y), i = 1, . . . ,q he an 
exponential distortion or tilt of the reference distribution, 

^=exp(a.+/3^x), i = l,...,q (4) 

Six) 

where x = (xi,... ,X(L_i),y)' and /3j = (/3ji, • • • jfti)'. Since the gi{x.), i = 
1, . . . ,q,m are probability densities, /3j = implies Oi = 0, j = 1, ..., q. It 
follows that the hypothesis Hq : Pi = ■ ■ ■ = (3^ = implies equidistribution: 
all the gi are equal. Model dU is referred to as a density ratio model. 

To estimate the parameters and the reference g, or equivalently the ref- 
erence distribution function G, we follow the same procedure described in 
Fokianos et al (2001), Qin and Zhang (1997), and Qin (1998). First the data 
are combined in a single vector t of length n = ni + n2 + • • • + n^, 

t = i{xiji,Xij2,...,Xij(^L_i),yij) : i = l,... ,q,m, j = l,...,niy 

= Kt'„...xy (5) 

where tj = {tixi, ■ ■ ■ -.tixL-iitiy)' ■ The idea is to approximate the reference 
distribution function by a step function G (same notation is used) with 
jumps Pi at all the observed points (Vardi 1982,1985). For the three dimen- 
sional case: 

Pi — ^\^ixii'^ix2i'^iy) '^\J'i—l,xn^ix2i'^iy) '^\^ixii'^i—l,X2T'^iy) 



+ Lr(tixnti— 1^X2 T'^i—l,y) '^\J'i—l,xi7'^i~l,X2J'^i—i,y)i ^ — i,...,fi. 

Thus, the pi are the jumps in the L-dimensional step function G at ti, ..., t„. 
The empirical hkehhood (cf Owen 2001) is then a function of pi, a = 

(ai,...,aq)' and (3 = {I3[, ..., P'g)': 

n ni 

L{a,(3,G) = Y[piY[exjp{ai+ l3iiXiki-\ ^ Pi(^L-i)Xik{L-i) + PiLVik) 

i=l k=l 
■riq 

■■■ W e-KY>{aq + (iqlXqkl -\ h f3q{L-l)Xqk(L-l) + l^qLVqk) (6) 

k=l 

subject to the constraints 

n n n 

Xl^« = ^' ^Wl{t.i)pi = l,..., ^Wq{ti)pi = l (7) 

1=1 j=l i=l 

where 

Wj{ti) = exp(Qj + /3^tj), j = 1, ..., q. 

Estimates for dj and 0j are obtained by solving the score equations: 
A = _y ^^•^^(^-) +n=0 (8) 

dUj f^^l + PlWl{ti)^ V pqWq{ti) ^ 

9/3j f^-^^ + PlWl{ti) + ■ ■ ■ + PqWq{ti) ^ 

for J = 1, . . . , g and /9j = nj/rim- Then 

Pi = -— (10) 

Um l + PlWl{ti)^ V pqWq{ti) 

G(t) = i-.y; . /f-^" .,, (11) 

where (tj < t) is defined pointwise, Wj{ti) = exp(aj + $jti), and /(i?) is 

the indicator of the event B. It can be shown that the estimators = 
(qi, • • • , dg, Pi,- • • , /Sq)' are asymptotically normal 

V^(0 - 0o) ^ iV(0, S) (12) 



as n — 7- oo, where Oq denotes the true parameters and 5] = S^^VS is defined 
in the appendix. 

Notice that Qjiti), j = 1, . . . ,q can be estimated as exponential tilts of 
Pi. Thus, under the L-dimensional density ratio model we can predict the 
response y given the covariate information xi,X2, ■ ■ ■ ,X(j^_i) for any of the 
m data sets as follows: 

P^ , I \ ^ ffj(xi,...,X(L-l),yO . 

Ej[y xi,...,X(i_i)) = 2_^yi— — — -, J = l,...,q,m. 

(13) 
The (jj are kernel density estimates in the sense of Fokianos (2004), 

1 " 
9j{^o) = j^'^PiWj{ti)K{{ti-zo)/h) ,j = l,...,m. (14) 

where zq is L-dimensional. From Fokianos (2004) the asymptotic mean 
integrated square error (AMISE) of gj converges toOasn— >-cx),/i— )'0 and 
nh^ — ;• oo. This implies that Ej{y \ xi, . . . , X(£,_i)) is a consistent estimator 
of E{y I xi, . . . , X(^_i)) in the jth population in the sense of convergence in 
probability at least for bounded data (see the appendix). 

As in the Nadaraya- Watson kernel estimate (Nadaraya 1964, Watson 
1964), the estimated conditional expectation (I13p is of the form ^iWiyi, 
where the Wi are positive weights which sum to 1, except that here the Wi 
also depend on the yi. 

2.2 Diagnostic Plots and Measures of Goodness-of-Fit 

The density ratio model motivates graphical and quantitative diagnostic 
tools for measuring both goodness-of-fit of the model and the quality of the 
regression (fT3]l . Goodness-of-fit tests have been proposed by Gilbert (2004), 
Qin and Zhang (1997), and Zhang (1999,2001,2002), where the appropriate- 
ness of the model is judged by the closeness of the estimated reference distri- 
bution to the corresponding empirical distribution. Bondell (2007) suggests 
a reformulation of this in terms of the corresponding kernel density esti- 
mates. We suggest data analytic tools to measure discrepancies stemming 
from both all case and control (reference) groups. 

Graphical evidence of goodness-of-fit can be obtained from the plots of 
Gi versus the corresponding empirical multivariate distribution function Gi, 
i = 1, . . . ,q,m, evaluated at some selected L-dimensional points as to obtain 
two dimensional plots. Figures 1 and 2 in the next section are examples of 
this. 



We found the foUowing measure of goodness-of-fit useful. Consider the 
ith sample of size n^, and let x be the number of times the estimated semi- 
parametric cdf falls in the 1 — a confidence interval obtained from the cor- 
responding empirical cdf, both evaluated at the sample points. Define 

<fc = l-exp|-(^— )4 (15) 

I Ui- X ) 

where A; > 0. Observe that: 

• i?^ ^ takes values between and 1, being close to 1 when x approaches 
n,- and close to when x is close to 0. 



• i?^ fc is a flexible criterion that can be adjusted by changing the pa- 
rameters a and k. 

• Computing i?^ ^ is both simple and fast. 

We describe next three natural alternatives to R^ ^. First, as in multiple 
regression, goodness-of-fit may be approached by residual analysis. In this 
vein, consider the decomposition 

E[y - E{y)]^ = E[y - E{y \ x)f + E[E{y \ x) - E{y)f . 
Therefore, from the approximations y ~ E{y) and y = E{y \ x), 

I J2iy^ - ^)' « ^ T.iy^ - y^f + ^ Y.iy^ - yf 

we define "i?^" as in linear regression: 

Next, define 

rI = corr{y,yf (17) 

Lastly, following Qin and Zhang (1997), define 

R\ = exp(— -y/n • max \ Gi — Gi |) (18) 

Clearly, i?| takes values between and 1. Alternatives to i?| are exp(— y^- 
median \Gi- Gi \) or exp(-^ Y.\Gi- Gi p). 

The following simulation study suggests that i?^ ^ is a more pragmatical 
indicator of goodness-of-fit compared to i?^, R^, and i?|. We note that the 
coefficient of determination suggested in Nagelkerke (1991) was not found 
sufficiently sensitive in the present context. 



3 Some Simulation Results 

In the present simulation study m = 2, and (72 denotes the reference distri- 
bution. We considered the following bivariate cases (runs). 

1. 51 ~ N{{0, 0)', S)), 52 ~ iV((0, 0)', S)) with ^=[2 3 ) ' ^1 = 40, 

712 = 30. 

2. 51 ~ iV((0, 0)', S)), 52 ~ iV((l, 1)', S)) with S = f ^ 2 ) ' ""^ = ^°°' 
n2 = 200. 

3. 5i from standard two dimensional Multivariate Cauchy and 52 from 

/ 5 5 \ 
two dimensional Multivariate Cauchy with /x = (l,l)',V = 

ni = 200, n2 = 200. 

4. 5i from standard two dimensional Multivariate Cauchy and 52 from 
uniform distribution on the triangle (0, 0), (6, 0), (—3, 4), and ni = 200, 
n2 = 200. 

The normal distribution follows the density ratio model, but this is not 
true for the Cauchy and the uniform distributions. Hence we expect to see 
straight lines in the diagnostic plots and high i?^'s, as defined above, in cases 
(1) and (2). On the other hand, we expect to see deviations from straight 
lines in the diagnostic plots and lower R^'s in cases (3) and (4). 

Figures 1-2 show the estimated Gi and G2 (where Gi is the exponential 
tilt of G2 defined in pT|) ) versus the empirical cdf Gi and G2, respectively, 
all obtained from the simulated case-control data, and evaluated at selected 
2D points. As expected, in cases (1),(2), there is almost a perfect agreement 
between Gi versus Gi, i = 1,2, whereas Figure 2 shows clearly that the 
density ratio model is not appropriate for the data from cases (3) and (4). 

A comparison of the different measures of goodness of fit is given in Table 
1. Apparently here -Rf and i?2 ^^^^ misleading as measures of goodness of fit. 
They are erroneously higher at the cases where the simulated distributions 
do not follow the density ratio model. It seems B^ is more appropriate than 
both Ri and iJ^ but it is sensitive to outliers and can give low values even 
for data that follow the density ratio model (e.g. case 2). On the other 
hand, the proposed measure -R^ ^^ classifies correctly the four cases, giving 
high values for runs (1) and (2) and low values for (3) and (4). The values 
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Figure 1: Case-control plots of Gi vs. G^, z = 1, 2, cases (1) and (2) 
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Figure 2: Case-control plot of Gi vs. Gi, i = 1,2, cases (3) and (4) 



of R^ ^. in Table 1 were calculated with k = 2 and 1 — a = 90%. In general, 
i?^ I, gets closer to Rn by lowering 1 — a. 



Table 1: Comparison of goodness of fit measures for case and control. 



Run 


Group 


Rj 


Rl 


Rl 


r2 
^10,2 


(1) 


Case 


0.1947 


0.6196 


0.7702 


1 




Ctrl 


0.3123 


0.4812 


0.7422 


1 


(2) 


Case 


0.0290 


0.0470 


0.3281 


0.9998 




Ctrl 


0.1214 


0.2356 


0.3651 


0.9999 


(3) 


Case 


0.6948 


0.8441 


0.1390 


0.1469 




Ctrl 


0.6792 


0.7537 


0.1294 


0.1219 


(4) 


Case 


0.4978 


0.5662 


0.0340 


0.0999 




Ctrl 


0.4277 


0.4372 


0.0305 


0.0001 



Figure 3 shows the estimated E\Y \ X] using equation ()13p for the first 
two cases. The prediction line is apparently influenced by the endpoints but 
otherwise it is a smooth curve. Superimposed is the line obtained from sim- 
ple linear regression. From Table 2, except in one case, the semiparametric 
prediction gives lower MSE and MAE than linear regression. 

Table 2: Comparison of the semiparametric regression (jlSp and linear re- 
gression for simulations (1) and (2) in terms of MSE and MAE. 



MSE MAE 







Scm 


. Prediction 


Linear Reg. 


Sem 


I. Prediction 


Linear Reg. 


Simulation 1 


Gi 




1.4534 


1.3251 




0.8583 


0.8758 




Ga 




1.1540 


1.3363 




0.8640 


0.9693 


Simulation 2 


Gi 




0.8187 


0.8411 




0.7224 


0.7302 




G2 




1.2909 


1.4825 




0.9820 


0.9919 



4 Application to Testicular Germ Cell Cancer 

Testicular germ cell tumor (TGCT) is a common cancer among U.S. men, 
mainly in the age group of 15-35 years (McGlynn et al 2003). In McG- 
lynn et al (2007) it was shown that increased risk was significantly related 
to height, whereas body mass index was not significant. In Kedem et al 
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Figure 3: Comparison of E[Y \ X] in (jl3|) obtained from the density ratio 
model and from linear regression for the 1st and 2nd simulation 



(2009), using the two dimensional semiparametric model, it was shown that 
jointly height and weight are significant risk factors. The TGCT data con- 
sist of age (years), height (cm) and weight (kg) of 1691 individuals, of which 
ni = 763 are cases and 77-2 = 928 belong to the control group. In the present 
application m = 2 and the chosen bandwidth is h = 0.3. We use the semi- 
parametric regression (J13p to predict weight given the covariates age and 
height for both the case and the control groups, holding the control distri- 
bution as reference. The measures of goodness of fit discussed in Section 
(2.2) are applied to assess the model and the predictions, and the results 
are compared with those from multiple regression and the Nadaraya- Watson 
regression. 

Before applying the three-dimensional density ratio model to the TGCT 
data, it is interesting to apply the two-dimensional model to get prediction 
of weight given only height. As Figure 4 shows, the density ratio model is 
a suitable model for the TGCT data: there is almost a perfect agreement 
between the plots of the estimated semiparametric Gi and the corresponding 
empirical Gi, i = 1,2. Figure 5 shows the estimated E\V \ X] using equation 
(fT3]l for the case and control groups. Superimposed is the regression line 
obtained from linear regression under the normal assumption. The residual 
plots in Figure 6 are centered around zero. The MSE values for control 
are 89.33101 for the semiparametric model and 92.26375 for the regression 
model; for case the MSE values are 99.10386 (semiparametric) and 99.51023 
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Figure 4: 2D problem: Plots of Gi versus Gi, i = 1,2 evaluated at 
(height, weight) pairs for the case and control groups from the TGCT data. 
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Figure 5: Comparison of E[weight \ height] in (fT3]) and linear regression. 
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Residual plot for Case 



Residual plot for Control 
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Figure 6: Residual plots for the TGCT case and control groups from 
E[weight \ height] in (J13p . 



(regression). The corresponding MAE values are 7.212146 (semiparametric) 
and 7.295542 (regression) for control, and 7.801397 (semiparametric) and 
7.784165 (regression) for case. The value of i?20 i is 1 for both case and 
control. 

From the preceding results, in the 2D problem the two models give sim- 
ilar results. However, from Table 3, the introduction of the covariate age 
results in a significant reduction in MSE and MAE from the semiparamet- 
ric model, whereas the multiple regression MSE and MAE stay almost un- 
changed. 

Tables 4 and 5 give some predicted values for weight given age and height 
for the two models. These, the diagnostic plots in Figures 7 and 8, and the 
fact i?20 1 — ^ fo^ both case and control tell us that the semiparametric 
model is quite appropriate for the TGCT data. 

We end this section by noting that, as expected, E{y \ x) in (fT3]) tends 
to he close to the average of y 's which correspond to the same x. This is 
demonstrated in Table 6 which gives the case-control weight predictions (llSp 
and the actual weights. Empty entries in the table correspond to subjects 
with the same height and age (i.e. same x), but possibly different weights. 
The averaging property can be seen by averaging the run of weights in the 
"empty cells" and the run upper bound. Thus, for example, the control- 
weights corresponding to age 22 and height 175.26 average to 74.3894 and 
the prediction is 75.67005. 
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Table 3: Case-control MSE and MAE for weight given height and age. 



Case 



Control 



MSE 



MAE 



MSE 



MAE 



Semiparametric model 77.09442 6.794231 
Multiple regression 96.36657 7.678669 



73.97413 6.45978 
90.29083 7.24368 



Table 4: Predicted control values of weight given height and age. 



Case 


Age 


Height 


Weight 


Semiparametric 


Mult. Regression 


26 


193.04 


102.058 


99.09662 


92.47554 


24 


167.64 


72.575 


71.28688 


70.00329 


29 


180.34 


65.771 


81.45212 


82.42360 


38 


185.42 


81.647 


85.17266 


89.46406 


34 


195.58 


89.811 


86.53122 


97.03194 


27 


162.56 


58.967 


59.08479 


66.51540 



Table 5: Predicted case values of weight given height and age. 



Control 


Age 


Height 


Weight 


Semiparametric 


Mult. Regression 


29 


180.34 


90.718 


81.32563 


82.06293 


39 


175.26 


77.111 


76.02246 


80.36549 


19 


172.72 


63.503 


72.09521 


73.58821 


33 


177.80 


83.915 


84.51664 


80.97707 


31 


190.50 


102.058 


90.09859 


90.67494 


25 


165.10 


58.967 


59.11650 


68.90777 
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Figure 7: Case-control plots of Gi versus Gj, i = 1,2 for the 3D TGCT 
problem: the Gi,Gi are evaluated at selected (age,height,weight) triplets. 
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Figure 8: Residual plots from the regression of weight given height and age 
using p^ for the case and control groups from the TGCT data. 
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Table 6: Case-control weight and E[weight/height,age]. 





Height 


Control 




Case 


Age 


Weight 


E[W 1 H,A] 


Weight 


E[W 1 H,A] 


27 


162.56 


58.967 


59.08212 


58.967 


59.08479 


28 


162.56 


77.111 
68.039 


69.85564 


65.771 


69.90699 


30 


165.10 


68.039 


70.01298 


72.575 


70.02422 


37 


165.10 


69.40 


66.70674 


63.503 


66.72657 


25 


167.64 


86.183 


77.18912 


72.575 
90.718 
63.503 


77.4467 


30 


167.64 


72.575 


80.2309 


88.451 


80.37712 


18 


170.18 


61.235 


67.6877 


72.575 


67.7608 


32 


170.18 


70.307 
63.503 


72.34239 


81.647 


72.46667 


37 


172.72 


74.843 


80.73858 


88.451 


80.84287 


40 


172.72 


70.307 
77.111 


77.72428 


90.718 


77.8726 


22 


175.26 


77.111 
65.771 
79.379 
83.915 
65.771 


75.67005 


86.183 
65.771 
86.183 


75.8382 


25 


175.26 


68.039 
83.915 
74.843 
83.915 
79.379 
86.183 


75.74933 


79.379 
72.575 
83.915 
74.843 
72.575 
74.843 
61.235 
61.235 
65.771 
79.379 


75.86762 


26 


177.80 


79.379 
81.647 
58.967 


78.7966 


77.111 

104.326 

77.111 


78.9987 



81.647 
79.379 
74.843 
88.451 
68.039 



16 





Height 


Control 




Case 


Age 


Weight 


E[W\H,A] 


Weight 


E[W 1 H,A] 


42 


177.80 


70.307 


78.36252 


91.626 


78.60568 


20 


180.34 


79.832 
65.771 
77.111 
79.379 


76.52451 


84.368 
68.039 
79.379 
81.647 
72.575 


76.60789 


33 


180.34 


79.379 


79.5773 


77.111 
81.647 


79.58786 


18 


182.88 


77.111 


72.1744 


68.039 


72.22253 


41 


182.88 


79.379 


82.39921 


86.183 


82.42601 


19 


185.42 


63.503 


72.83926 


68.039 
94.347 
68.039 


73.14691 


21 


185.42 


86.183 
72.575 
102.058 


84.16899 


79.379 
77.111 
97.522 


84.40131 


22 


190.50 


97.522 
95.254 


85.90334 


86.183 
71.668 


86.15233 


31 


190.50 


102.058 


90.09859 


104.326 
74.843 


90.54886 


22 


193.04 


86.183 


87.78239 


102.058 
80.739 


87.94713 


24 


193.04 


99.337 
86.183 
99.790 
108.862 


99.43246 


108.862 


99.58426 


34 


193.04 


113.398 


104.4838 


88.451 
117.934 


104.8948 


34 


195.58 


83.915 


86.50797 


89.811 


86.53122 



4.1 Comparison With Nadaraya- Watson 

The Nadaraya- Watson (NW) kernel estimator oi E{y \ x) is estimated by 
a weighted average X^j ^i2/i where the Wi are large for Xj close to x and 
small for Xj farther away from x. Thus, nearest neighbors are counted but 
Xj whose distance from Xj is relatively large are discounted (Lee 1996, p. 
144). From experience, for sufficiently condensed data the NW estimator 
and ([T3|) are quite comparable. Thus, it is interesting to compare (fT3]) 
with the Nadaraya- Watson estimator for troublesome scenarios when x is 
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an extreme case with few or no neighbors. 

For such a comparison, we identified 15 extreme (age, height) case pairs 
and 15 control pairs, and tried to predict the corresponding weights from 
the rest (or "middle" ) of the TGCT data. Accordingly, the semiparametric 
model parameters were estimated with rii = 40 values from case and n2 = 50 
values from control, giving a total of n = 90 observations. 

We encountered a technical difficulty when computing the NW estimator 
using R, as the default value for the control bandwidth for age was huge, 
168,395,326. However, for case the default bandwidths were 5.432816 for 
for age and 4.25757 for height. Thus, to conform to these last bandwidths, 
we fitted the semiparametric model once with bandwidth equal to 4 and 
once with bandwidth equal to 5, and report only case results in Table 7. It 
should be noted that the NW estimator was computed with ni = 40 case 
observations, whereas p3p was computed from a combined sample with 
ni + 712 = 90 case and control observations. From Table 7 it seems that 
the semiparametric prediction is somewhat more immune to extremes than 
NW. 

Table 7: Control MSE and MAE for E[weight \ height, age]. 

MSE MAE 

Semiparametric model with bw=4 101.543 9.222 

Semiparametric model with bw=5 90.525 8.657 

Multiple Regression 109.729 8.848 

Nadaraya- Watson 119.984 10.252 



5 Summary 

In this paper we have demonstrated that the L-dimensional density ratio 
model is useful in estimating the conditional expectation of a response vari- 
able given random covariates when multiple data sources are available. In 
addition we have suggested overall qualitative (graphical) and quantitative 
validation measures to assess the suitability of the method. The simulation 
results, the analysis of the TGCT data, and the comparison with multiple 
regression and the Nadaraya- Watson estimator point to the merit of the 
method, at least for a small number of covariates. 
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The approach offers a way of understanding how multivariate distribu- 
tions representing many different data sources are related to each other. 
This leads to a ramification of the notion of regression where the objective 
is to model relationships between distributions. Relationships between re- 
sponse variables and their covariates, corresponding to the data sources, are 
byproducts. 

Finally, we note that the suggested validation measures may shed light 
on the selection of h.{x) in ([2]). This idea was not dealt with in the present 
paper. 

Appendix: Asymptotic Results 

Computing S, V 

Define 

( d d d d \ 



, dai ' ' doq ' d(3i ' 9/3^ J 
Then^[V/(0)] = E[Vl{ai, . . . ,ag,Pi, . . . , (Sg)] = 0. Let 





^.(t) 


= / twj{t)dG{t) 




MJ,r) 


- J i + EUp^Mtf ^^ 




Ai(j,/) 


- J i + EUP.Mt)'''^'^ 




A2(j,/) 




j,y = i, 


. . . ,q. The entries in 




V = Var 


Vl{ai,...,aq,(3^,...,pg) 



are 



1.. . 9^ X P'^ 



(19) 



-Van-) ^ ^^^r-^JM.j)-EPrAlU,r)] 
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l^ , dl dl , P^ 



^"•^H^-ai^ = TTEf:;^l-^«("-)^^('''-S/-^»0''-)^iO-'')l 



I „ , dl dl 



PjPj 



m 



n^°"fe-afe' = TTix:^i-^«("-'>^Ht')-Ep.-4oa'-)A;(/.oi 



r=l 



I „ , dl dl 



m 

+ i?,v(t)Ai(i,/) - ^p,Ai(j,r)A;(/,r)] 



r=l 

+ - Xm C'o^[fei> • • • ) Vji), {xj'ku ■■■, yj'k)'] 



^ • 1 • 1 

«=1 1=1 



The last term is zero for j ^ j'. 

As n — )■ oo, 



^VV'/(ai,...,a„/3i,...,/3,)]^S (20) 



where S is a q{l + L) x q(l + L) matrix with entries corresponding to j, j' 
1 dP p, f Wj{t)[l + j:l^^pkWk{t) 



1,..., 



9aj 1 + J2l=i Pk J 1 + Efc=i /^fc^fc (* 

1 dP - pjpf f Wj{t)wj,{t) ^^^^^ 



-(iG(t) 



19/2 p, /■u;,(t)t^[l + E^^,Pfc^'fc(t)] 



ndajdfB'j '^ + Y.l=iPkJ '^ + Y.l=iPkW kit 

1 5/2 -PjPj' f Wj{t)wji {t)t' 



dG{t) 



ndajdfi'j, '^ + T,l=iPkJ ^ + T,k=iPkWk{t 

1 a/2 p^. /•w',(t)tt'[l + E^^,Pfc^fc(t)] 



ndf3jdp'j l + El=iPkJ l + El=iPkWk{t 

1 5/2 ~PjPj' f Wj{t)Wjr (t)tt' 



dG{t) 

dG{t) 



ndpjd/s',, i + Ek=iPkJ i + Ek=iPkWk{t 



dG{t) 
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Consistency of the Semiparametric Regression 

Let X be a vector of size k = L — 1 of bounded covariates, and y a bounded 
response. We wish to prove E(y|2;) — )■ E(y|2;). We have: 

kT.7=iyi9ix,yi) Jyg{x,y)dy 



Eyx)-Eyx) 



li:U9{x,y,) 9{x) 

Or for sufficiently large n 

Iy9{x,y)dy Jyg{x,y)dy J y[g{x,y) - g{x,y)]dy 



E{y\x)-E{y\x)- 

9{x) 

Thus by Cauchy-Schwarz, 



9{x) 



9[x) 



\E{y\x) - E{y\x)\g{x)dx 



l2 



< 
< 

< c 



From Fokianos (2004) and Qin and Zhang (2005) the MISE converges to 0. 
That is 



y[gix,y) - g{x,y)]dy \ dx 

- 2 

y\\[9{x,y) - g{x,y)]\dydx 
yl'^dydx \[g{x,y) - g{x,y)]\'^dydx 

x,y) - g{x,y)]\'^dydx 



E 



|[g(x,y) -g(x,y)]pdydx 







as nh —7- cx) and /i — t- 0. It follows that 



{E|E(y|x)-E(y|x)|}g(x)dx 



<E 



n2 



|E(y|x)-E(y|x)|g(x)dx 







and we have the convergence in mean E|E(y|x) — E(y|x)| — )■ which implies 
convergence in probability. 
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